ORA-00600 [kgeade_is_0] and kxfpg1sg

Normally when you get ORA-00600 [string], you search on Metalink using keywords "ORA-600 string" 
(no quotes; "600" and "00600" are the same on Metalink). Today I have many of this error

ORA-00600: internal error code, arguments: [kgeade_is_0], [], [], [], [], [], [], []

on my 10.2.0.4 2-node RAC database running on Linux x86_64, filling up filesystem until I set 
max_dump_file_size to a very small number. I can't find relevant notes or bugs; the bugs are all 
related to renaming a file from ASM to filesystem (see Note:742289.1 for Bug 7207932). I can easily 
reproduce my error with a query involving gv$ view, as simple as select * from gv$instance.

The trace file has a long call stack:

ksedst()+31
ksedmp()+610
ksfdmp()+21
kgerinv()+161
kgeasnmierr()+163
kgeade()+509
kgerev()+58
kserec0()+186
kxfpg1sg()+1894
kxfpgsg()+1969
kxfrAllocSlaves()+351
kxfrialo()+2080
kxfralo()+313
kxfrAllocSlaves()+351
kxfrialo()+2080
kxfralo()+313
qerpx_rowsrc_start()+3844
qerpxStart()+234
selexe()+667
opiexe()+4671
kpoal8()+2273
opiodr()+984
kpoodrc()+38
rpiswu2()+420
kpoodr()+1020
upirtrc()+2164
kpurcsc()+125
kpuexecv8()+1710
kpuexecv8()+1710
kpuexec()+2602
OCIStmtExecute()+41
ktte_aggregate_finfo()+3062
ktte_monitor_tsth()+772
ktte_monitor_ts()+355
ksbcti()+1301
ksbabs()+804
kebm_mmon_main()+318
ksbrdp()+794
opirip()+616
opidrv()+582
sou2o()+114
opimai_real()+317
main()+116
__libc_start_main()+244               
_start()+41        

Ignore the number after +; it's the offset from the function address. Focus on the top part of the 
functions:

ksedst -> ksedmp -> ksfdmp -> kgerinv -> kgeasnmierr -> kgeade -> kgerev -> kserec0 -> kxfpg1sg -> 
kxfpgsg

KS in KSE means Kernel Service layer. E probably means Error handling. Ksedst dumps the current 
call stack in this trace file, and ksedmp dumps the process state. Kgeade may be interesting 
because the first argument in our ORA-600 error is kgeade_is_0 (asserting the kgeade() function 
should not return 0?). According to Bug 6954816, kgeade is "KGE ADd Error onto the error stack". 
This simply confirms that whatever function near the top of the stack beginning with kge is not worth 
looking into, even if it or its variant is the first argument in the ORA-600 error. After all, if 
the code already reaches the error handling routine, it's already passed the bad function that 
triggered this error. With that in mind, let's search for the first one (or last one in order of 
time) that does not begin with kge. It is kxfpg1sg in my case.

It didn't take me long to find out what this function is related to. This function appears in one 
of my old snapshots of v$latch_misses as a value for location. The latch that uses this function is 
"query server process". So it clearly indicates its relationship with parallel execution process, 
which Oracle uses when you query a gv$ view. On Metalink, the most relevant hit may be Bug 5072023 
(ORA-00600 [15735] RUNNING A QUERY ON GV$ AND DBA_* VIEWS). Unfortunately it offers no workaround. 
But it points out that "The error ORA-600 means that the message to join the group (for the PQ 
slaves) is too large". Another good hit is Note:455202.1 (ORA-00600[15735] WHEN QUERYING A TABLE 
WHOSE PARALLEL DEGREE IS >1). Although the triggering event is not a query on a gv$ view, the note 
suggests lowering parallel_execution_message_size. Indeed, earlier I manually increased the value 
from its meager 2152 to 16384, the maximum for pre-11g RAC (Note:6394739.8). I lowered the value 
back and bounced the entire database. The error is no longer generated.

The moral of the story is that we should read the call stack top down, but skip ALL error handling 
or trace dumping functions, even though the first argument of the ORA-600, or ORA-7445 for that 
matter, tells you otherwise.

*************
2009-02 update:
----- Michael.B.JonesATwellsfargoDOTcom wrote:
> Your note at http://yong321.freeshell.org/oranotes/ORA-600%5Bkgeade_is_0%5D.txt proved very 
> helpful in resolving this issue.  What I found was the answer to the problem, however, was not 
> to change the parallel_execution_message_size parameter setting back to 2K, but to make the value 
> match on all nodes of out RAC database.  A setting of 8K helps processing a lot and does not 
> generate the error.  A mismatch between values on different nodes caused the error and synching 
> them stopped the core dumps immediately.

I tested and confirmed it. Thanks! In fact, Oracle Reference has the comment "Multiple instances 
must have the same value" for this parameter.
*************